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ABSTRACT 

We compare the performance of two automated classification algorithms: k- 
dimensional tree (kd-tree) and support vector machines (SVMs), to separate quasars 
from stars in the databases of the Sloan Digital Sky Survey (SDSS) and the Two 
Micron All Sky Survey (2MASS) catalogs. The two algorithms are trained on sub- 
sets of SDSS and 2MASS objects whose nature is known via spectroscopy. We choose 
different attribute combination as input patterns to train the classifier using photo- 
metric data only and present the classification results obtained by these two methods. 
Performance metrics such as precision and recall, true positive rate and true negative 
rate, F-measure, G-mean and Weighted Accuracy are computed to evaluate the per- 
formance of the two algorithms. The study shows that both kd-tree and SVMs are 
effective automated algorithms to classify point sources. SVMs show slightly higher 
accuracy, but kd-tree requires less computation time. Given different input patterns 
based on various parameters(e.g. magnitudes, color information), we conclude that 
both kd-tree and SVMs show better performance with fewer features. What is more, 
our results also indicate that the accuracy using the four colors (u — g, g — r, r — i, 
i—z) and r magnitude based on SDSS model magnitudes adds up to the highest value. 
The classifiers trained by kd-tree and SVMs can be used to solve the automated clas- 
sification problems faced by the virtual observatory (VO); moreover, they all can be 
applied for the photometric preselection of quasar candidates for large survey projects 
in order to optimize the efficiency of telescopes. 

Key words: Classification, Astronomical databases: miscellaneous, Catalogs, Meth- 
ods: Data Analysis, Methods: Statistical 



1 INTRODUCTION 

In the recent years, the sizes of astronomical data based on 
surveys at different wavebands are increasing rapidly. As- 
tronomy has entered a data avalanche era. The most im- 
portant and challenging issues for the efficient analysis of 
large multi-wavelength astronomical data rely on data min- 
ing tools, which will allow the selection, classification, regres- 
sion, clustering and even the definition of particular object 
types within the databases. 

Our primary goal is to perform reliable star-quasar sep- 
aration. Since stars and quasars are point sources, their clas- 
sification is an important issue in astronomy. In the recent 
past a lot of work has been carried out on automated ap- 
proaches. Hatziminaoglou et al. (2000) explored a new joint 
method (avoiding usual biases) for distinguishing between 
quasars and stars/galaxies by their photometry. Wolf et al. 
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(2004) explored a photometric method for identifying stars, 
galaxies and quasars in multi-color surveys. McGlynn (2004) 
used decision trees to build an online system for automated 
classification of X-ray sources. Carballo et al. (2004) selected 
quasar candidates from combined radio and optical surveys 
using neural networks. Suchkov et al. (2005) applied ClassX, 
an oblique decision tree classifier optimized for astronomi- 
cal classification and redshift estimation in the Sloan Digital 
Sky Survey (SDSS) photometric catalog. Ball et al. (2006) 
classified stars and galaxies with the SDSS DR3 using deci- 
sion trees. 

In this work we investigate the application of support 
vector machines (SVMs) and k-dimensional tree (kd-tree) 
to effectively select quasar candidates. SVMs have been suc- 
cessfully applied in astronomy for mainly the following prob- 
lems: classification of variable stars (Wozniak et al. 2001, 
2004), galaxy morphology classification (Humphreys et al. 
2001), solar-flare detection (Qu et al. 2003), classification of 
multiwavelength data (Zhang & Zhao 2003, 2004), estima- 
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tion of photometric redshifts of galaxies (Wadadekar 2005; 
Wang et al. 2007) and matching different object catalogs in 
astrophysics (Rohde et al. 2005, 2006). Wang et al. (2007) 
investigated SVMs and Kernel Regression (KR) for pho- 
tometric redshift estimation with the data from the SDSS 
Data Release 5 (DR5) and the Two Micron All Sky Survey 
(2MASS). On the other hand, the kd-tree method is used 
on the 5 flux-space indexing in the SDSS science archive to 
partition the bulk data (Kunszt et al. 2000). Maneewong- 
vatana & Mount (2002) presented an empirical analysis of 
two new splitting methods for kd-trees: sliding-midpoint and 
minimum-ambiguity, which were designed to remedy some 
of the deficiencies of the standard kd-tree splitting method, 
with respect to data distributions that are highly clustered 
in low-dimensional subspaces. Hsieh et al. (2005) used kd- 
tree algorithm to divide their sample in order to improve the 
redshift accuracy of galaxies. Kubica et al. (2007) employed 
kd-tree for efficient intra- and inter-night linking of asteroid 
detections. Gao et al. (2008) introduced some application 
cases of kd-tree in astronomy. In real application, estima- 
tion of photometric redshifts belongs to regression problem. 
Although SVMs and kd-tree are applied for both classifica- 
tion and regression problems, they can't solve them simulta- 
neously. For classification problem, the predicted parameter 
is discrete; while for regression problem, the predicted pa- 
rameter is continuous. When dealing with the two different 
tasks, the methods should be adjusted. 

The structure of this paper is as follows: Section 2 gives 
the sample collection and parameter selection. Section 3 
presents the brief introduction of kd-tree and SVMs. Sec- 
tion 4 illustrates the results and discussion, and the conclu- 
sion is presented in Section 5. 



2 SAMPLE AND PARAMETER SELECTION 

The Sloan Digital Sky Survey (SDSS, York et al. 2000) uses 
a dedicated, wide field, 2.5m telescope at Apache Point Ob- 
servatory, New Mexico. Imaging is carried out in drift-scan 
mode using a 142 mega-pixel camera in five broad bands, 
u g r i z, spanning the range from 3000 to 10,000A. The 
corresponding magnitude limits for the five bands are 22.0, 
22.2, 22.2, 21.3 and 20.5, respectively. The Fifth Data Re- 
lease (DR5) of the SDSS includes all survey quality data 
taken through June 2005 and represents the completion of 
the SDSS-I project. It includes five-band photometric data 
for 215 million unique objects selected over 8000 deg 2 , and 
1,048,960 spectra of galaxies, quasars, and stars selected 
from 5740 deg 2 of that imaging data. The magnitude lim- 
its for the spectroscopic samples are r(Petrosian) — 17 .77 for 
the galaxies and i(PSF)=19.1 for quasars with redshifts up 
to 2.3 and i(PSF)— 20.1 for quasars with higher redshifts. 

The Two Micron All Sky Survey (2MASS) project 
(Cutri et al. 2003) is designed to close the gap between our 
current technical capability and our knowledge of the near- 
infrared sky. 2MASS uses two new, highly-automated 1.3m 
telescopes, one at Mt. Hopkins, AZ, and one at CTIO, Chile. 
Each telescope is equipped with three-channel camera, each 
channel consisting of 256x256 array of HgCdTe detectors, 
capable of observing the sky simultaneously at J (1.25^im), 
H (1.65/xm) and K s (2.17/im), to 3a limiting sensitivity of 



17.1, 16.4 and 15.3 mag in the three bands. The number of 
2MASS point sources adds up to 470,992,970. 

We collected photometric data of quasars and stars 
with spectra measurement from SDSS DR5, then cross- 
identified the 2MASS database with these photometric 
data within a 2 arcsec radius by the federation system of 
XMaS.VO. XMaS.VO is developed by China-VO project 
and mainly used for automation of creating databases and 
cross-identification of catalogues from different bands (Gao 
et al. 2008). We obtained the samples, as shown in Table 1. 
The result shows that almost every SDSS object has the 
counterpart in the 2MASS database and there are only less 
than 100 missing data records. In our work, for all objects 
under consideration the SDSS and 2MASS magnitudes are 
available. In this way the issue of inhomogeneous coverage 
or non-detections is not dealt with. 

In order to study the distribution of stars and quasars 
in the multi-dimensional space, we use different magnitudes: 
PSF magnitude (u p g p r p i p z p ), model magnitude (u g r i 
z) and model magnitude with reddening correction (it' g' r' 
i' z , hereafter short for dereddened magnitude) from SDSS 
data, J, H and K s magnitudes from 2MASS catalog. The 
dereddened magnitudes are corrected by Galaxy extinction 
using the dust maps of Schlegel et al. 1998. J, H and K s 
is the selected "default" magnitude for each band, respec- 
tively. If the source is not detected in the band, this is the 
95% confidence upper limit derived from a 4 arcsec radius 
aperture measurement taken at the position of the source 
on the Atlas Image. We explored different input patterns 
composed of these attributes. 

The mean values of the features selected as input pat- 
terns are given in Table 2, which shows the statistical prop- 
erties of the samples. The first, second and third columns 
give the number, name and description of the parameters, 
respectively. The following columns list the mean values of 
the parameters with standard errors for quasars and stars. 
Obviously, the mean values of parameters of the samples 
are different for different classes of objects, especially for 
the color indexes. Therefore it is reasonable and applicable 
to discriminate quasars from stars with these features. In 
order to investigate the distribution of different objects in 
2D scatter plots, we randomly select some parameters and 
subsamples of quasars and stars for visual inspection and 
plot them in Figure 1. 

Taking the pattern (u—g, g—r, r—i, i—z, r) for example, 
we apply principal component analysis (PCA) on the sam- 
ple. PCA is a statistical method that permits the determina- 
tion of the minimum number of independent or uncorrelated 
variables underlying a larger number of observed variables 
(Kendall 1957; Kendall & Stuart 1966). Thus, PCA is used 
as a technique for both data compression and analysis, in 
addition, PCA can be used as an unsupervised method for 
classification. As for PCA used in astronomy, we refer to e.g. 
Connolly & Szalay (1999 and references therein) or Zhang 
& Zhao (2003). The result of PCA shows that the first three 
eigenvectors carry 99.30%, 0.41% and 0.17%, respectively, of 
the descriptive power. This means that the first three vec- 
tors actually carry most of the information, especially the 
first one. We study the distribution of quasars and stars in 
the principal component space. To be simple, the subsam- 
ple randomly selected from the overall sample is shown in 
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Table 1. The number of samples from different catalogs 



Catalog 


Number of Quasars 


Number of Stars 


SDSS 


76,949 


108,744 


SDSS+2MASS 


76,863 


108,679 



Figure 1. PCI, PC2 and PC3 are short for the first, second 
and third principal components, respectively. 

It is obvious from Figure 1 that quasars and stars are 
not easy to discriminate from each other due to overlapping 
in the two-color diagrams or the principal component spaces. 
Therefore we need rely on machine learning or data mining 
techniques to realize the separation between quasars and 
stars in the high dimensional space . 



3 THE CLASSIFICATION ALGORITHMS 
3.1 Kd-tree 

K-dimensional tree (kd-tree), as a computer science 
term, is a space-partitioning data structure for or- 
ganizing points in a k- dimensional space (Bentley, 
1975). For more information about kd-tree, we refer to 
http://en.wikipedia.org/wiki/Kdtree A kd-tree uses only 
splitting planes that are perpendicular to one of the coordi- 
nate system axes. In addition, in the typical definition every 
node of a kd-tree, from the root to the leaves, stores a point. 
As a consequence, each splitting plane must go through one 
of the points in the kd-tree. Kd-trees are a variant that 
store data only in leaf nodes. It is worth noting that in 
an alternative definition of kd-tree the points are stored in 
its leaf nodes only, although each splitting plane still goes 
through one of the points. Technically, the letter k refers to 
the numbers of dimensions. A 3-dimensional kd-tree can be 
called as 3d-tree. A graphical representation of a 3d-tree is 
shown in Figure 2. Kd-tree organizes a set of datapoints in 
k-dimensional space in such a way that once built, whenever 
a query arrives requesting a list all points in a neighborhood, 
the query can be answered quickly without needing to scan 
every single point. Each tree node represents a subvolume 
of the parameter space, with the root node containing the 
entire k dimensional volume spanned by the data. Non-leaf 
nodes have two children, obtained by splitting the widest 
dimension of the parent's bounding box, the left child own- 
ing those data points that are strictly less than the splitting 
value in the splitting dimension, and the right child owning 
the remainder of the parent's data points. Kd-tree is usually 
constructed top-down, beginning with the full set of points 
and then splitting in the center of the widest dimension. This 
produces two child nodes, each with a distinct set of points. 
A kd-tree can be constructed by repeating the procedure 
recursively. 



3.2 Support Vector Machines 

Support vector machines (SVMs) are a set of related super- 
vised learning methods used for classification and regression. 
SVMs can be considered as a special case of Tikhonov reg- 
ularization. The idea of SVMs is to map input vectors non- 
linearly into a high-dimensional feature space and construct 



the optimal separating hyperplane in the high-dimensional 
feature space. SVMs were originally developed by Vapnik 
(1995), became popular because of many attractive features, 
and promises empirical performance. SVMs have various pa- 
rameters that can be tuned for optical performance, includ- 
ing the kernel function. Popular kernels consist of linear, 
polynomial and radial basis function. SVMs also allow ad- 
justing the soft margin, which is a parameter that controls 
the trade-off between smooth and overly complex functions. 
Controlling this trade-off is necessary to obtain good gen- 
eralization. Functions that represent the training data well 
but do not generalize to novel examples are said to have 
overfit the data in machine learning terminology. The soft 
margin is a tool for SVMs to avoid overfitting (Rohde et al. 
2005). 

3.3 Performance Measurement 

Besides the overall classification accuracy, we use metrics 
such as true negative rate, true positive rate, Weighted 
Accuracy (WA), G-mean (CM), precision, recall, and F- 
measure (FM) to evaluate the performance of classification 
algorithms (Chen & Liaw 2004). These metrics have been 
widely used for comparison of different classifiers. All these 
metrics are functions of the confusion matrix as shown in 
Table 3. TP is short for the true positive, FN for the false 
negative, FP for the false positive, TN for the true negative. 
In the process of classification, quasars are labeled as pos- 
itive, stars as negative. The rows of the matrix are actual 
classes, and the columns are the predicted classes. Based on 
Table 3, the above-mentioned metrics are defined as follows: 



Accuracy {Acc) = 



TP + TN 



TP + FP + TN + FN 



True Positive Rate(Acc ) — 
True Nagative Rate(Acc^) = 



TP 



TP + FN 

TN 
TN + FP 



(1) 

= Recall (2) 
(3) 



Precision = 



TP 



TP + FP 



F — measure(FM) 



2 x Precision x Recall 



Precision + Recall 
G - mean(GM) = (Acc~ x Acc + )^ 



(4) 
(5) 
(6) 



Weighted Accuracy(WA) =/3x Acc + + (1 - (3) x Acc (7) 

Recall is the fraction of actual positive cases that were 
correct, and precision is the fraction of the predicted posi- 
tive cases that were correctly identified. For any classifier, 
there is always a trade off between recall and precision. The 
Geometric Mean (G-mean) is useful to determine "average 
factors" . The F- measure can be interpreted as a weighted av- 
erage of the precision and recall. Weighted Accuracy uses an 
adjusted parameter j3 to suit different applications. Here we 
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Table 2. The mean values of parameters for the samples. 



o. 


Parameters 


Dcsciption 


Quasars 


oT-ars 


1 




SDSS PSF u magnitude 


19.78 ± 1.38 


20.78 ±2.57 


2 


g" 


SDSS PSF g magnitude 


19.31 ±0.89 


19.32 ± 2.23 


3 


r p 


SDSS PSF r magnitude 


19.06 ±0.76 


18.56 ± 1.81 


4 


i p 


SDSS PSF i magnitude 


18.89 ±0.75 


18.02 ± 1.51 


5 


zP 


SDSS PSF z magnitude 


18.80 ±0.76 


17.73 ± 1.48 


6 


uP - gP 


SDSS PSF u-g color 


0.48 ± 0.79 


1.46 ± 1.04 


7 


gP _ r P 


SDSS PSF g-r color 


0.25 ± 0.35 


0.76 ± 0.81 


8 


r P _ jjt> 


SDSS PSF r - i color 


0.16 ± 0.20 


0.54 ±0.87 


9 


P> - zP 


SDSS PSF i - z color 


0.10 ± 0.18 


0.30 ±0.59 


10 


u 


SDSS model u magnitude 


19.74 ± 1.41 


20.77 ±2.63 


11 


g 


SDSS model g magnitude 


19.23 ±0.93 


19.26 ± 2.24 


12 


r 


SDSS model r magnitude 


18.96 ±0.84 


18.49 ± 1.83 


13 


i 


SDSS model i magnitude 


18.81 ±0.85 


17.96 ± 1.52 


14 


z 


SDSS model z magnitude 


18.71 ±0.87 


17.67 ± 1.50 


15 


u — g 


SDSS model u — g color 


0.50 ±0.81 


1.50 ± 1.09 


16 


g — r 


SDSS model g — r color 


0.26 ± 0.37 


0.78 ± 0.85 


17 


T — i 


SDSS model r — i color 


0.17 ±0.21 


0.53 ± 0.89 


18 


i — z 


SDSS model i — z color 


0.10 ± 0.18 


0.29 ± 0.62 


19 


u' 


SDSS dereddened model u magnitude 


19.58 ± 1.41 


20.58 ± 2.63 


20 


g' 


SDSS dereddened model g magnitude 


19.12 ±0.93 


19.13 ± 2.24 


21 


r' 


SDSS dereddened model r magnitude 


18.89 ±0.84 


18.39 ± 1.83 


22 


i' 


SDSS dereddened model i magnitude 


18.74 ±0.85 


17.88 ± 1.52 


23 


z' 


SDSS dereddened model z magnitude 


18.87 ±0.87 


17.61 ± 1.50 


24 


u'-g' 


SDSS dereddened model u — g color 


0.46 ± 0.81 


1.45 ± 1.09 


25 


gl _ , r ' 


SDSS dereddened model g — r color 


0.23 ± 0.37 


0.74 ±0.85 


26 


r'-t' 


SDSS dereddened model r — i color 


0.15 ± 0.21 


0.51 ± 0.89 


27 


i'-z' 


SDSS dereddened model i — g color 


0.08 ± 0.18 


0.27 ±0.62 


28 


J 


2MASS J magnitude 


15.58 ± 1.39 


15.39 ± 1.21 


29 


H 


2MASS H magnitude 


15.00 ± 1.33 


14.91 ± 1.21 


30 


K s 


2MASS K a magnitude 


14.67 ± 1.23 


14.72 ± 1.19 


31 


J - H 


2MASS J — H color 


0.59 ± 0.27 


0.48 ± 0.25 


32 


H-K s 


2MASS H - K a color 


0.32 ± 0.37 


0.19 ± 0.31 



Table 3. Confusion matrix. 





Predicted Positive Class 


Predicted Negative Class 


Actual Positive class 
Actual Negative class 


TP (True Positive) 
FP (False Positive) 


FN (False Negative) 
TN (True Negative) 



use equal weights for both true positive rate and true nega- 
tive rate; i.e., /3 equals 0.5. These metrics are commonly used 
in the information retrieval area as performance measures. 
We will adopt all these measurements to compare our meth- 
ods with different patterns. Train-test and ten- fold cross- 
validation were carried out to obtain all the performance 
metrics. 



4 RESULTS AND DISCUSSION 

Our experiments are performed using the kd-tree java pack- 
age (http:/ /www.cs.wlu.edu~levy/software/kd/ ) written by 
Simon D. Levy and SVMLight which is an implementation of 
SVMs in C language (http://svmlight.joachims.org/). The 
configuration of the PC computer used was Microsoft Win- 
dows XP, Pentium (R) 4, 3.2 GHz CPU, 1.00 GB mem- 
ory. One advantage of the empirical training set approach 
to classification is that additional parameters can be eas- 



ily incorporated. More parameters may be taken as inputs. 
In order to study which parameters influence the classifica- 
tion accuracy, we probe different input patterns to separate 
quasars from stars. We compare the performance of kd-tree 
and SVMs with different input patterns. Our experiment re- 
sults are shown in Tables 4-8. We calculate accuracy, true 
positive rate, true negative rate, precision, F-measure, G- 
mean, Weighted Accuracy and running time for all exper- 
iment results. We apply these criteria to determine which 
pattern is best. Here we take quasars as the positive class 
and stars as the negative one. For Weighted Accuracy, we 
adopt equal weights for both true positive rate and true 
negative rate (/3 equals 0.5). 



4.1 Results of kd-tree 

Firstly, we explore kd-tree to isolate quasars from stars with 
different input patterns. Each of the samples is randomly di- 
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Figure 1. Scatter plots of random subsample (filled squares represent quasars; open ones represent stars.): the upper four diagrams are 
color-color diagrams; the lower two diagrams are PCI vs. PC2 and PCI vs. PC3. 



vided into two parts: two thirds for training a classifier and 
one third for testing the classifier to get the classification 
rate. This method is usually called train-test method. For 
different input patterns, the number of samples is different. 
The label Q (Q for quasars) or S (S for stars) as input in- 
dex is inserted into the samples and used to build a kd-tree 
classifier in a supervised way. Then we use the test samples 
to get the optimal value of n nearest neighbors. For each 
test sample, we need judge if there are more than half of 
the n nearest neighbors which are equal to the test sample's 
input index to obtain correct or incorrect prediction. So the 
n value must be an odd integer to avoid half-and-half case. 
In theory, the higher values of n provide smoothing that 
reduces vulnerability to noise in the training data. In prac- 
tical applications n is typically in units or tens rather than 
in hundreds or thousands. We set n=ll for this experiment 
because of its higher accuracy. The magnitudes in five bands 
(it, g, r, i, z) are taken as the first set of input parameters 
for kd-tree, and then the four color index (u — g, g — r, r — i, 



i — z) and r magnitude are as input patterns. There will be 
more information for classification if more parameters are 
included. J, H and K s magnitudes (or J — H and H — K s ) 
from 2MASS catalog are added as extra inputs to build our 
classifier. We compare the performance of different input 
patterns based on PSF magnitudes, model magnitudes and 
dereddened magnitudes. The comparison of different input 
patterns are listed in Table 4. 

Table 4 shows that for any input patterns using kd- 
tree method, the accuracy is rather high, more than 94.47%, 
with high values of F-measure, G-mean and Weighted Accu- 
racy, and the running time is less than 5 minutes. Generally, 
the performance of similar input patterns based on model 
magnitudes adds up to a higher accuracy than those based 
on other kinds of magnitudes, those of dereddened magni- 
tudes are better than those of PSF magnitudes. For these 
three kinds of magnitudes, the results based on four col- 
ors and r magnitude as input patterns outperform those of 
the five magnitudes. The accuracy does not always increase 
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Figure 2. A 3-dimensional kd-tree. The first split (red) cuts the root cell (white) into two subcells, each of which is then split (green) 
into two subcells. Finally, each of those four is split (blue) into two subcells. Since there is no more splitting, the final eight are called 
leaf cells. The yellow spheres represent the tree vertices. 



with more features considered, for example, the accuracy 
seems to decrease when the input patterns given parameters 
J, H, K a , or J — H, H — K s . Only when appropriate fea- 
tures adopted, the performance is best. In the situations of 
fewer features, kd-tree shows better performance and uses 
less building time. As shown by Table 4, the four model 
color index (u — g, g — r, r — i, i — z) and the model r magni- 
tude as the input pattern obtains the highest accuracy which 
amounts to 97.26%, and the highest value of F-measure, G- 
mean and Weighted Accuracy which are 96.69%, 97.14% and 
97.14%, respectively, moreover the running time is shorter, 
not more than 1 minute. 



From the above results, we conclude that the four model 
colors (u — g, g — r, r — i, i — z) and the model r magnitude is 
the best input pattern for kd-tree when setting n = 11. Now 
we adopt such pattern as input pattern and investigate the 
influence of the n value on the performance of kd-tree. We 
change the value of n for different experiments. By compar- 
ing accuracy, F-measure, G-mean and Weighted Accuracy of 
classification and the running time taken to build a classifier, 
we estimate the efficiency and effectiveness of the classifiers 
created by different n values in the experiments. Through 
the attempts, we obtain the optimal n value of nearest neigh- 
bors. Here we adopt the odd integer of n from 3 to 29 in our 
experiments. Table 5 indicates that the highest accuracy of 
classification is 97.263% when n—11, and the next highest 
results are 97.262% and 97.252% when n=7 and n=9, re- 
spectively. The running time is longer when the value of n is 
bigger in our experiment. As n=7, the simultaneous highest 
values of F-measure, G-mean and Weighted Accuracy are 
96.690%, 97.149% and 97.151%, respectively. We also see 
that the true positive rate is higher for n=7 than for n=9 
or 11, which means that the classifier we build using n=7 
gives high prediction accuracy over the quasar class, while 
maintaining reasonable accuracy for the star class. 



4.2 Results of SVMs 

Since the best input pattern is four model colors (u — g, 
g — r, r — i, i — z) and model r magnitude for kd-tree, we 
apply such input pattern to create the SVM classifier. The 
kernel function of SVMs we choose is the radial basis func- 
tion (RBF). When using RBF SVMs, there are two adjusted 
parameters: 7 is the parameter in RBF kernel and c is the 
trade-off between training error and margin. Here we try to 
compare the classifier created by the different values of these 
two parameters in our experiments, and the results are listed 
in Table 6. It can be seen that the best accuracy (97.50%) 
is obtained using RBF SVM classifier with 7=5 and c—1 or 
5 and the building time is 21 min and 28 min, respectively. 
However, the highest F-measure value is 97.78% with 7=8 
and c=l, and the highest values of G-mean and Weighted 
Accuracy are 97.41% and 97.41% with 7=5 and c=5. We 
also find that the true positive rate when 7=5 and c=5 is 
superior to that when 7=5 and c=l, but the former takes 
more training time. Table 6 shows that in the situation of 
the smaller c value, less running time is generally taken. So 
when 7 equals 5 and c equals 0.1, we take the least time to 
build the SVM classifier, while 7 equals 0.01 and c equals 
1000, the time taken adds up to 1 day and 14 hours. 

On account of the highest values of G-mean and 
Weighted Accuracy in Table 6, the optimal values of 7 and 
c are 5 for RBF SVMs. Then setting 7=5 and c=5, we com- 
pute accuracy, true positive rate, true negative rate, preci- 
sion, F-measure, G-mean and Weighted Accuracy for dif- 
ferent input patterns using RBF SVMs in Table 7. Clearly 
based on accuracy, F-measure, G-mean and Weighted Ac- 
curacy, the input pattern of (u — g, g — r, r — i, i — z, r) 
is the optimal pattern. Using four colors and r magnitude 
(u — g, g — r, r — i, i — z, r) as input pattern, the performance 
of SVMs is better than the five magnitudes (u, g, r, i, z). 
Similar to the result of kd-tree, the performance based on 
the model magnitudes outperforms that based on the dered- 
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Table 4. The comparison of different input patterns using kd-tree when n=ll. 



Input patterns 


Acc 


Acc+ 


Acc~ 


Precision 


FM 


GM 


WA 


Time 




(%) 


(%) 


(%) 


(%) 


(%) 


(%) 


(%) 


(s) 


U P,gP,rP,iP,zP 


96.32 


95.34 


97.02 


95.76 


95.55 


96.17 


96.18 


27 


uP - gP , gP - rP ,rP - iP ,iP - zP,rP 


97.02 


96.24 


97.58 


96.56 


96.40 


96.90 


96.91 


18 


U P,gP,rP,iP,zP,J - H,H - K s 


95.82 


94.90 


96.48 


95.01 


94.96 


95.69 


95.69 


83 


uP - gP , gP - rP ,r-P - iP ,iP - zP , rP , J - H, H - K s 


96.62 


95.89 


97.14 


95.96 


95.92 


96.51 


96.52 


166 


uP,gP,rP,iP,zP,J,H, K s 


94.81 


93.91 


95.45 


93.58 


93.75 


94.67 


94.68 


90 


uP - gP , gP - r-P ,rP - iP ,'iP - zP ,rP , J, H, K s 


95.76 


95.08 


96.24 


94.70 


94.89 


95.66 


95.66 


166 


u, g, r, i, z 


96.46 


95.42 


97.20 


96.02 


95.72 


96.31 


96.31 


26 


u-g,g-r,r-i,i-z,r 


97.26 


96.41 


97.87 


96.97 


96.69 


97.14 


97.14 


54 


u,g,r,i, z, J — H, H — K s 


95.85 


94.90 


96.53 


95.09 


94.99 


95.71 


95.72 


91 


u — g^ g — 7", 7" — 2,2 — 2, 2*, J — H — Ks 


96.76 


96.02 


97.28 


96.15 


96.08 


96.65 


96.65 


148 


u, g, r, i, z, J, H, K s 


94.87 


93.88 


95.57 


93.75 


93.81 


94.72 


94.73 


91 


u — g,g — r,r — i,i — z,r, J, H, K s 


95.85 


95.09 


96.39 


94.90 


95.00 


95.74 


95.74 


167 


u' , g' , r' ,i' , z' 


96.41 


95.42 


97.10 


95.88 


95.65 


96.26 


96.26 


25 


u' -g',g' -r',r' -z',r' 


97.19 


96.37 


97.76 


96.82 


96.60 


97.07 


97.07 


44 


u',g',r',i',z',J- H,H - K s 


95.8 


95.00 


96.47 


95.01 


95.00 


95.73 


95.74 


81 


«' -9',9' -r',r' - z',r',J - H,H - K s 


96.68 


95.97 


97.18 


96.00 


95.99 


96.57 


96.57 


151 


u',g',r',i',z', J,H,K S 


94.73 


93.78 


95.41 


93.52 


93.65 


94.59 


94.59 


89 


u' -g',9 1 -r',r' — z', r', J, H, K s 


95.76 


95.04 


96.27 


94.74 


94.89 


95.65 


95.65 


178 



Table 5. The comparison of different n value for kd-tree with four model colors (u — g, g — r, r — i, i — z) and model r magnitude as 
input pattern. 



■11 


Acc 


Acc+ 


Acc - 


Precision 


FM 


GM 


WA 


Time 




(%) 


(96) 


(96) 


(%) 


(%) 


(%) 


(%) 


(s) 


3 


97.167 


96.479 


97.654 


96.677 


96.578 


97.065 


97.066 


29 


5 


97.239 


96.554 


97.723 


96.775 


96.665 


97.137 


97.139 


35 


7 


97.262 


96.503 


97.799 


96.877 


96.690 


97.149 


97.151 


41 


9 


97.252 


96.452 


97.818 


96.902 


96.677 


97.133 


97.135 


46 


11 


97.263 


96.413 


97.866 


96.966 


96.689 


97.137 


97.140 


50 


13 


97.228 


96.412 


97.804 


96.882 


96.646 


97.106 


97.108 


54 


15 


97.187 


96.342 


97.785 


96.853 


96.597 


97.061 


97.064 


57 


17 


97.130 


96.227 


97.768 


96.826 


96.526 


96.994 


96.998 


61 


19 


97.102 


96.153 


97.774 


96.831 


96.491 


96.960 


96.964 


64 


21 


97.081 


96.149 


97.740 


96.785 


96.466 


96.941 


96.945 


67 


23 


97.064 


96.113 


97.737 


96.780 


96.445 


96.922 


96.925 


70 


25 


97.043 


96.082 


97.723 


96.760 


96.420 


96.899 


96.903 


73 


27 


97.004 


96.023 


97.698 


96.724 


96.372 


96.857 


96.861 


76 


29 


96.968 


95.956 


97.684 


96.702 


96.328 


96.816 


96.820 


78 



dened magnitudes, which is superior to that based on the 
PSF magnitudes. When adding more parameters from near 
infrared band, the performance doesn't improve, even de- 
crease. This shows that J — H and H — K s contribute little 
information for classification. 

Finally, we use 10-fold cross-validation to evaluate the 
performance and the speed to build the classifiers of kd- 
tree and SVMs and adopt the four colors (it — g, g — r, 
r — i, i — z) and r magnitude as input pattern in order to 
compare their performance. Cross-validation is a generally 
applicable and very useful technique for many tasks often 
encountered in machine learning, such as accuracy estima- 
tion, feature selection or parameter tuning. Cross-validation 
is used within a wide range of machine learning approaches, 
such as kd-tree and SVMs. K-fold cross-validation is an im- 
portant cross-validation method applicable for data set with 
moderate size. The data is randomly partitioned into k sub- 
samples. Each time, one of the subsamples is retained as the 



testing data, and the remaining k — 1 subsamples are put 
together to form the training data. The cross-validation pro- 
cess is then repeated k times, then the mean error and the 
evaluated value across all trials is computed. The compari- 
son of the efficiency and effectiveness of the two methods is 
based on the metrics such as true negative rate, true positive 
rate, Weighted Accuracy, G-means, precision, recall, and F- 
measure to evaluate the performance of learning algorithms. 

Given the metrics in Table 8, the best results only with 
respect to accuracy is obtained using RBF SVM classifier 
with 7=5 and c=5, when the accuracy of classification is 
97.65% and the standard error is 0.22%. Furthermore, we 
get the accuracy of 97.65% when 7=2 and c=10, and 97.64% 
when 7=5 and c=l, but the standard error of the latter 
is the smallest among these three best cases. The highest 
F-measure is 98.01% with 7=10 and c=l; the highest G- 
mean and Weighted Accuracy value are 97.55% and 97.56%. 
Kd-tree obtains the best result when n=9. The highest val- 
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Table 6. The comparison of different options using RBF SVMs with four model colors (u — g, g — r, r — i, i — z) and model r magnitude 
as input pattern. 



Algorithm 


Soft margin 


Acc 


Acc+ 


Acc~ 


Precision 


FM 


GM 


WA 


Time 


(RBF kernel) 


c 


(%) 


(96) 


(%) 


(96) 


(96) 


(%) 


(96) 


(m) 


7 = 5 


c = 0.1 


96.85 


95.14 


98.07 


97.22 


97.64 


96.59 


96.60 


17 


7 = 0.01 


c = 1 


92.09 


88.69 


94.50 


91.95 


93.21 


91.55 


91.60 


32 


7=5 


c=l 


97.50 


96.67 


98.09 


97.27 


97.68 


97.38 


97.38 


21 


7 = 8 


c = 1 


97.48 


96.50 


98.17 


97.39 


97.78 


97.33 


97.34 


30 


7 = 0.01 


c = 10 


92.90 


89.37 


95.40 


93.22 


94.30 


92.34 


92.39 


47 


7 = 5 


c = 10 


97.41 


96.86 


97.79 


96.88 


97.33 


97.32 


97.33 


69 


7=5 


c=5 


97.50 


96.88 


97.93 


97.07 


97.50 


97.41 


97.41 


49 


7 = 8 


c = 5 


97.38 


96.61 


97.92 


97.04 


97.48 


97.26 


97.26 


54 


7 = 0.01 


c = 1000 


94.55 


94.26 


96.02 


92.47 


93.60 


94.50 


94.50 


2280 



Table 7. The comparison of different input patterns using RBF SVM when c=5 and 7=5. 



Input patterns 




Acc 


Acc+ 


Acc" 


Precision 


FM 


GM 


WA 


Time 






(96) 


(%) 


(96) 


(%) 


(96) 


(%) 


(%) 


H 


uP,gP,r-P,iP,zP 




96.97 


96.47 


97.40 


96.97 


97.19 


96.94 


96.94 


58 


u p -g p ,gP -rP,rP -iP,iP - 


Z P,rP 


97.39 


96.95 


97.77 


97.39 


97.58 


97.36 


97.36 


42 


u,g,r,i,z 




97.15 


96.64 


97.59 


97.16 


97.37 


97.11 


97.11 


61 


u-g,g-r,r-i,i-z,r 




97.50 


96.97 


97.93 


97.49 


97.71 


97.45 


97.45 


14 


u — g, g — r,r — i,i — z,r, J — 


H,H — Kg 


97.17 


96.16 


97.93 


97.18 


97.55 


97.04 


97.04 


62 


u',g',r',i',r' 




97.15 


96.72 


97.53 


97.16 


97.34 


97.12 


97.12 


60 


u' — g',g' — r',r' — %' ,i' — z' , 


r' 


97.47 


97.02 


97.85 


97.47 


97.66 


97.44 


97.44 


43 



ues of accuracy, F-measure, G-mean and Weighted Accu- 
racy amounts to 97.45%, 97.66%, 97.32% and 97.32%, re- 
spectively, and these results are a little better than those 
when n—11. Clearly based on the metrics in Table 8, we can 
hardly tell the difference between kd-tree and SVMs. SVMs 
is slightly better than kd-tree in G-mean, F-measure and 
Weighted Accuracy. Since the accuracy of the two learning 
algorithms is more than 97.0%, the two methods are effective 
classifiers to isolate quasars from stars. 

4.3 Performance Comparison of kd-tree and 
SVMs 

From the tables above we conclude that kd-tree and SVMs 
are comparable to separate quasars from stars in respect 
of the accuracy. When only considering the running time, 
kd-tree is much faster than SVMs, for the speed of kd-tree 
is measured by seconds while that of SVMs is measured 
by minutes, as shown in Tables 4-7. Taking into account 
both accuracy and speed, kd-tree shows its superiority, be- 
cause the speed to build the SVM classifier is very slow. 
Moreover, the performance obtained by the 10-fold cross- 
validation method gets higher accuracy than the train-test 
method because the cross-validation method has the advan- 
tage of producing an effectively unbiased error estimate, but 
it is computationally expensive (about 10 times longer than 
train-test method). As a result, the classifiers trained with 
kd-tree and SVMs can be used to classify the unclassified 
sources and be applicable to preselect quasar candidates 
from SDSS and other survey catalogs. 

The sources inclined to be misclassified due to their 
intrinsic properties are equally prone to misclassification 



wether kd-tree or SVMs is used. Most of the sources mis- 
classified by kd-tree overlap those misclassified by SVMs, 
as proved by the experimental results. In order to visualize 
the classification results, we take the kd-tree method as an 
example. In Figure 3, we plot the quasars and misclassified 
quasars as the function of redshifts. Figure 3 shows that the 
peak of the quasar sample lies in the redshift range 1 to 
2, while the peak of misclassified quasars lies in the range 
2.5 to 4. The highest peak of the redshift distribution for 
misclassified quasars occurs at z~2.8, which is exactly the 
redshift range in which the distinction between M stars and 
quasars becomes problematic when the Sloan photometric 
system is used. The misclassification simply indicates that, 
no matter what the classification method is, one is prone 
to the same biases because of the very nature of objects. 
That the peak of misclassified quasars' r-band magnitude is 
faint is again due to the fact that the magnitude limit of the 
spectroscopic sample was fainter for higher redshift objects. 
In addition, we investigate the classified result as the func- 
tion of magnitude, as shown in Figure 4. From Figure 4, it 
is obviously found that the peak of misclassified quasars or 
stars (right panel in Figure 3) shifts to the faint magnitude 
compared to quasars and stars (left panel in Figure 3). In 
other words, the faint sources are inclined to be misclassi- 
fied, which possibly results from the small sample size and 
low S/N ratio for these faint sources. We further want to 
know why the misclassified sources are prone to be mis- 
classified, so we consult the misclassified quasars and stars 
from SIMBAD astronomical database and NASA/IPAC Ex- 
tragalactic Database (NED). Of the misclassified stars, the 
most objects are CV stars, white dwarfs, RR star, carbon 
stars, some objects are ultra-violet sources, X-ray sources, 
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Table 8. The comparison of different n or (7 and c) using 10-fold cross-validate method with four model colors (u — g, g — r, r — i, i — z) 
and model r magnitude as input pattern. 



n 


/OS c) 




Acc 


Acc^ 


Acc 


Precision 


r M 


(jNL 


WA 








(0? \ 


(%) 


to? \ 

(%) 


(07 \ 
(%) 


(%) 


(%) 


(%) 


n 


= 13 




97.40 ± 0.23 


96.47 ±0.31 


98.06 ±0.34 


97.24 ±0.48 


97.65 ±0.41 


97.26 ±0.23 


97.26 ±0.23 


n 


= 11 




97.42 ±0.25 


96.53 ±0.31 


98.06 ± 0.35 


97.24 ±0.48 


97.65 ± 0.41 


97.29 ±0.24 


97.29 ±0.24 


n 


=9 




97.45±0.23 


96.58 ±0.28 


98.07 ±0.34 


97.26 ± 0.47 


97.66±0.41 


97.32±0.22 


97.32±0.22 


n 


= 7 




97.44 ± 0.23 


96.60 ± 0.30 


98.03 ± 0.34 


97.21 ± 0.47 


97.62 ± 0.41 


97.31 ±0.22 


97.32±0.22 


c 


=1, 7= 


=5 


97.64±0.20 


96.76 ±0.26 


98.26 ±0.34 


97.52 ± 0.47 


97.89 ± 0.40 


97.50 ±0.19 


97.51 ±0.19 




— 1 -y — 


:8 


07 «2 + 21 


ok ra 4. n 22 


no 4- n 


07 c:q 4- f| Af\ 

Vl .<JZ> _l_ KJ.'-tXJ 


07 qr 4- n An 


07 47 4. n 1 q 


074740 10 


c 


= 1,7= 


10 


97.59 ±0.22 


96.50 ±0.26 


98.37 ±0.34 


97.66 ± 0.48 


98.01±0.41 


97.43 ±0.21 


97.43 ±0.21 


c 


= 1,7= 


-2 


97.46 ±0.19 


96.63 ± 0.26 


98.04 ±0.32 


97.22 ± 0.43 


97.63 ± 0.38 


97.33 ±0.17 


97.34 ±0.17 


c 


=5, 7= 


=5 


97.65±0.22 


96.97 ±0.24 


98.14 ±0.35 


97.37 ±0.35 


97.75 ± 0.42 


97.55±0.20 


97.56±0.20 


c 


= 5,7= 


8 


97.61 ±0.26 


97.40 ± 0.28 


98.17 ±0.39 


97.40 ±0.54 


97.78 ± 0.46 


97.49 ±0.24 


97.50 ±0.24 


c 


= 5,7= 


10 


97.54 ±0.26 


96.66 ±0.31 


98.17 ±0.37 


97.39 ± 0.52 


97.78 ± 0.45 


97.41 ±0.25 


97.41 ±0.25 


c 


= 10,7= 


=5 


97.61 ±0.26 


96.96 ±0.27 


98.07 ±0.41 


97.27 ±0.56 


97.67 ±0.48 


97.51 ±0.24 


97.51 ±0.24 


c 


= 10,7= 


=8 


97.50 ±0.26 


96.74 ±0.30 


98.03 ± 0.40 


97.20 ± 0.55 


97.61 ± 0.48 


97.38 ±0.25 


97.39 ±0.25 


c 


= 10,7= 


=10 


97.40 ±0.25 


96.53 ±0.32 


98.02 ± 0.36 


97.19 ± 0.50 


97.60 ±0.43 


97.27 ±0.24 


97.28 ±0.24 


c 


=10,7= 


=2 


97.65±0.22 


97.00 ±0.28 


98.10 ± 0.35 


97.31 ± 0.49 


97.70 ±0.42 


97.55±0.20 


97.55 ±0.20 



radio sources, blue sources, and HII region, some objects are 
galaxies and irregular spirals, a few are quasars. Of 893 mis- 
classified quasars, most are quasars with faint magnitudes, 
some are AGN, Seyfert 1, Seyfert 2, damped Lyman absorb- 
tion and radio sources, a part are 171 unidentified quasars, 
and the little part are 4 white dwarfs, one CV star, one AM 
star and 29 galaxies. 



5 CONCLUSION 

In this paper we have investigated k-dimensional tree (kd- 
tree) and support vector machines (SVMs) applied to the 
datasets from optical and infrared band catalogs (SDSS DR5 
and 2MASS), and tested it with different input patterns. 
We have computed the performance metrics such as preci- 
sion and recall, true positive rate and true negative rate, 
F-measure, G-mean and Weighted Accuracy to evaluate the 
performance of learning algorithms. Based on these met- 
rics from the results by kd-tree and SVMs, we can not tell 
clearly which is superior. Kd-tree and SVMs are compa- 
rable to separate quasars from stars only considering the 
accuracy. Nevertheless, kd-tree is much faster to create a 
classifier than SVMs with respect to the speed. In real ap- 
plications, there is one parameter (e.g. the number of neigh- 
bors) to adjust in the kd-tree method while there are two 
adjusted parameters (e.g. 7 and c) to control in the SVM ap- 
proach when using RBF kernel function. Therefore it is not 
easy to modulate optimal parameters and get good perfor- 
mance for SVMs. Given high accuracy, fast speed and easy 
modulation of parameters, kd-tree may be a good choice for 
classification. Furthermore, both kd-tree and SVMs show 
better performance when considering fewer input parame- 
ters. Among the input patterns based on the three kinds 
of SDSS magnitudes, the performance of the model mag- 
nitudes is the best and that of the dereddened magnitudes 
is better than that of the PSF magnitudes. The input pat- 
terns of four colors and r magnitude (u — g, g — r, r — i, 
i — z, r) gets better performance than the five magnitudes 
(u, g, r, i, z). We consider more parameters from 2MASS 



catalog as extra inputs to our classifiers, but the results are 
not better, which is possible attributed to the bright mag- 
nitude limit of J, H, K s ; however other issues might cause 
this effect, such as the low measurement precision of magni- 
tudes. In the experiments, we employ the train-test method 
and 10-fold cross-validation method to create classifiers. The 
results show that the cross-validation method is superior to 
the train-test method because the former method avoids the 
random selection of sample. When the data are complete or 
the quality and quantity of data further improve, the per- 
formance of classifiers will improve. These two approaches 
can be used to solve the classification problems faced in as- 
tronomy. These classifiers trained by these methods can be 
used to classify sources with multi-wavelength astronomical 
data and preselect quasar candidates for large surveys, such 
as the Chinese Large Sky Area Multi-Object Fiber Spec- 
troscopic Telescope (LAMOST). Moreover the two methods 
may be integrated into the data mining toolkit of Virtual 
Observatories. 
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Figure 3. The distribution of quasars and misclassified quasars as a function of redshift Z. 
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Figure 4. The distribution of quasars and misclassified quasars (dotted line) as well as stars and misclassified stars (solid line) as a 
function of model r magnitude. 
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